Extracting Persian-English Parallel Sentences from Document Level Aligned Comparable Corpus using Bi-Directional Translation

نویسندگان

  • Ebrahim Ansari
  • Mohammad Hadi Sadreddini
  • Alireza Tabebordbar
  • Richard WALLACE
چکیده

Bilingual parallel corpora are very important in various filed of natural language processing (NLP). The quality of a Statistical Machine Translation (SMT) system strongly dependent upon the amount of training data. For low resource language pairs such as Persian-English, there are not enough parallel sentences to build an accurate SMT system. This paper describes a new approach to use the Wikipedia as a comparable corpus to extract Persian-English parallel sentences and eventually improve SMT system performance . This new approach is also applicable to other low resource language pairs. In order to calculate the similarity score between two sentences, a novel bi-directional translation-based information retrieval system is proposed. A length penalty score is introduced to increase the accuracy of extracted corpus. Using extracted parallel sentences, the performance of existing Persian-English SMT is improved drastically.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracting an English-Persian Parallel Corpus from Comparable Corpora

Parallel data are an important part of a reliable Statistical Machine Translation (SMT) system. The more of these data are available, the better the quality of the SMT system. However, for some language pairs such as Persian-English, parallel sources of this kind are scarce. In this paper, a bidirectional method is proposed to extract parallel sentences from English and Persian document aligned...

متن کامل

Using English as Pivot to Extract Persian-Italian Parallel Sentences from Non-Parallel Corpora

Ebrahim Ansari ([email protected]) et al. 2017. Using english as pivot to extract persian-italian parallel sentences from non-parallel corpora. In " Applications of Comparable Corpora " edited book Berlin Linguistic Press (ed.). The effectiveness of a statistical machine translation system (SMT) is very dependent upon the amount of parallel corpus used in the training phase. For low-resource l...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

Creating a Persian-English Comparable Corpus

Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in Englis...

متن کامل

Developing Bilingual Plagiarism Detection Corpus Using Sentence Aligned Parallel Corpus: Notebook for PAN at CLEF 2015

Plagiarism detection is the process of locating text reuse within a suspicious document. The plagiarism detection corpora are used for evaluating plagiarism detection systems. In this paper, we present a bilingual PersianEnglish plagiarism detection corpus. We provide our corpus for the task of text alignment corpus construction in the PAN 2015 competition. Our approach is based on parallel cor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014